These dataset are information about all aspects of the flight, such as departure time, arrival time, departure airport, and arrival airport.
!pip install plotly==4.14.3
Requirement already satisfied: plotly==4.14.3 in /opt/anaconda3/lib/python3.8/site-packages (4.14.3) Requirement already satisfied: retrying>=1.3.3 in /opt/anaconda3/lib/python3.8/site-packages (from plotly==4.14.3) (1.3.3) Requirement already satisfied: six in /opt/anaconda3/lib/python3.8/site-packages (from plotly==4.14.3) (1.15.0)
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
df_flight=pd.read_csv('Flights_Dataset_2000.csv')
df_flight.head()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | ... | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2000 | 1 | 28 | 5 | 1647.0 | 1647 | 1906.0 | 1859 | HP | 154 | ... | 15 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | 2000 | 1 | 29 | 6 | 1648.0 | 1647 | 1939.0 | 1859 | HP | 154 | ... | 5 | 47 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2 | 2000 | 1 | 30 | 7 | NaN | 1647 | NaN | 1859 | HP | 154 | ... | 0 | 0 | 1 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3 | 2000 | 1 | 31 | 1 | 1645.0 | 1647 | 1852.0 | 1859 | HP | 154 | ... | 7 | 14 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4 | 2000 | 1 | 1 | 6 | 842.0 | 846 | 1057.0 | 1101 | HP | 609 | ... | 3 | 8 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
5 rows × 29 columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
df_flight.sample(20)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3300270 | 2000 | 8 | 18 | 5 | 835.0 | 835 | 900.0 | 905 | WN | 172 | N624 | 145.0 | 150.0 | 133.0 | -5.0 | 0.0 | HOU | PHX | 1020 | 2 | 10 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1506567 | 2000 | 4 | 8 | 6 | 915.0 | 915 | 1203.0 | 1200 | WN | 1305 | N719 | 168.0 | 165.0 | 154.0 | 3.0 | 0.0 | ISP | TPA | 1034 | 5 | 9 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3951304 | 2000 | 9 | 18 | 1 | 1505.0 | 1505 | 1554.0 | 1557 | NW | 845 | N501US | 109.0 | 112.0 | 77.0 | -3.0 | 0.0 | DTW | MSP | 528 | 11 | 21 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4858781 | 2000 | 11 | 28 | 2 | 732.0 | 735 | 1129.0 | 1141 | US | 1934 | N525AU | 177.0 | 186.0 | 160.0 | -12.0 | -3.0 | IAH | PHL | 1324 | 4 | 13 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3529017 | 2000 | 8 | 20 | 7 | 1023.0 | 1014 | 1154.0 | 1156 | AS | 309 | N791AS | 91.0 | 102.0 | 77.0 | -2.0 | 9.0 | SFO | PDX | 550 | 3 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1563950 | 2000 | 4 | 2 | 7 | 739.0 | 745 | 812.0 | 821 | NW | 981 | N8925E | 93.0 | 96.0 | 82.0 | -9.0 | -6.0 | MBS | MSP | 463 | 3 | 8 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 1657824 | 2000 | 4 | 10 | 1 | 1238.0 | 1240 | 1432.0 | 1423 | NW | 1479 | N751NW | 114.0 | 103.0 | 94.0 | 9.0 | -2.0 | PHL | DTW | 453 | 7 | 13 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3870637 | 2000 | 9 | 20 | 3 | 1015.0 | 1015 | 1127.0 | 1130 | WN | 2060 | N375 | 72.0 | 75.0 | 59.0 | -3.0 | 0.0 | LAX | SMF | 373 | 5 | 8 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 3598939 | 2000 | 8 | 5 | 6 | 2011.0 | 2010 | 2248.0 | 2235 | DL | 341 | N1738D | 337.0 | 325.0 | 311.0 | 13.0 | 1.0 | LAX | HNL | 2556 | 7 | 19 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2523384 | 2000 | 6 | 3 | 6 | 940.0 | 940 | 1110.0 | 1120 | WN | 135 | N737 | 210.0 | 220.0 | 191.0 | -10.0 | 0.0 | MCI | OAK | 1489 | 10 | 9 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4647583 | 2000 | 10 | 14 | 6 | 1851.0 | 1856 | 2033.0 | 2049 | US | 1057 | N116UW | 102.0 | 113.0 | 87.0 | -16.0 | -5.0 | CLT | BDL | 644 | 3 | 12 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5461501 | 2000 | 12 | 15 | 5 | 2042.0 | 1745 | 2224.0 | 1949 | DL | 247 | N954DL | 102.0 | 124.0 | 80.0 | 155.0 | 177.0 | ATL | DTW | 594 | 11 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4631227 | 2000 | 10 | 27 | 5 | 759.0 | 800 | 1039.0 | 1038 | DL | 305 | N104DA | 160.0 | 158.0 | 105.0 | 1.0 | -1.0 | LGA | ATL | 761 | 14 | 41 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2558176 | 2000 | 6 | 3 | 6 | 943.0 | 840 | 1229.0 | 1144 | UA | 1198 | N982UA | 106.0 | 124.0 | 92.0 | 45.0 | 63.0 | ORD | ORF | 717 | 3 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 430432 | 2000 | 1 | 15 | 6 | 1932.0 | 1937 | 2134.0 | 2145 | US | 1850 | N336US | 122.0 | 128.0 | 99.0 | -11.0 | -5.0 | MIA | CLT | 650 | 5 | 18 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 5683035 | 2000 | 12 | 2 | 6 | NaN | 613 | NaN | 945 | AA | 705 | UNKNOW | NaN | 272.0 | NaN | NaN | NaN | BOS | DFW | 1562 | 0 | 0 | 1 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 4156723 | 2000 | 9 | 16 | 6 | 1452.0 | 1453 | 2121.0 | 2124 | HP | 2614 | N618AW | 209.0 | 211.0 | 192.0 | -3.0 | -1.0 | PHX | ATL | 1587 | 6 | 11 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 141092 | 2000 | 1 | 11 | 2 | 1315.0 | 1320 | 1611.0 | 1610 | DL | 2518 | N320DL | 176.0 | 170.0 | 156.0 | 1.0 | -5.0 | ISP | MCO | 972 | 5 | 15 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 552272 | 2000 | 2 | 14 | 1 | 836.0 | 830 | 1010.0 | 1000 | WN | 572 | N92 | 94.0 | 90.0 | 76.0 | 10.0 | 6.0 | SAN | SMF | 480 | 4 | 14 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
| 2740709 | 2000 | 6 | 3 | 6 | 759.0 | 800 | 1112.0 | 1120 | DL | 2314 | N307DL | 193.0 | 200.0 | 165.0 | -8.0 | -1.0 | FLL | BOS | 1237 | 5 | 23 | 0 | NaN | 0 | NaN | NaN | NaN | NaN | NaN |
df_flight.describe()
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | FlightNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Distance | TaxiIn | TaxiOut | Cancelled | CancellationCode | Diverted | CarrierDelay | WeatherDelay | NASDelay | SecurityDelay | LateAircraftDelay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5683047.0 | 5.683047e+06 | 5.683047e+06 | 5.683047e+06 | 5.495557e+06 | 5.683047e+06 | 5.481303e+06 | 5.683047e+06 | 5.683047e+06 | 5.481303e+06 | 5.682778e+06 | 5.481303e+06 | 5.481303e+06 | 5.495557e+06 | 5.683047e+06 | 5.683047e+06 | 5.683047e+06 | 5.683047e+06 | 0.0 | 5.683047e+06 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| mean | 2000.0 | 6.534263e+00 | 1.576453e+01 | 3.955945e+00 | 1.356324e+03 | 1.346648e+03 | 1.487695e+03 | 1.500130e+03 | 1.140978e+03 | 1.289264e+02 | 1.295373e+02 | 1.064597e+02 | 1.047289e+01 | 1.128068e+01 | 7.643082e+02 | 6.061730e+00 | 1.565887e+01 | 3.299110e-02 | NaN | 2.508162e-03 | NaN | NaN | NaN | NaN | NaN |
| std | 0.0 | 3.441708e+00 | 8.792426e+00 | 1.992334e+00 | 4.923077e+02 | 4.805208e+02 | 5.265580e+02 | 5.041544e+02 | 8.420707e+02 | 7.141845e+01 | 7.068724e+01 | 6.766066e+01 | 3.599997e+01 | 3.355167e+01 | 5.728213e+02 | 4.602505e+00 | 1.151564e+01 | 1.786133e-01 | NaN | 5.001871e-02 | NaN | NaN | NaN | NaN | NaN |
| min | 2000.0 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | 0.000000e+00 | 1.000000e+00 | -6.000000e+00 | -2.700000e+01 | 1.000000e+00 | -1.298000e+03 | -9.900000e+02 | 1.100000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN |
| 25% | 2000.0 | 4.000000e+00 | 8.000000e+00 | 2.000000e+00 | 9.290000e+02 | 9.250000e+02 | 1.108000e+03 | 1.115000e+03 | 4.690000e+02 | 7.500000e+01 | 7.500000e+01 | 5.600000e+01 | -7.000000e+00 | -2.000000e+00 | 3.360000e+02 | 4.000000e+00 | 1.000000e+01 | 0.000000e+00 | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN |
| 50% | 2000.0 | 7.000000e+00 | 1.600000e+01 | 4.000000e+00 | 1.339000e+03 | 1.331000e+03 | 1.522000e+03 | 1.528000e+03 | 1.023000e+03 | 1.100000e+02 | 1.110000e+02 | 8.800000e+01 | 1.000000e+00 | 0.000000e+00 | 6.020000e+02 | 5.000000e+00 | 1.300000e+01 | 0.000000e+00 | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN |
| 75% | 2000.0 | 1.000000e+01 | 2.300000e+01 | 6.000000e+00 | 1.750000e+03 | 1.735000e+03 | 1.928000e+03 | 1.928000e+03 | 1.706000e+03 | 1.620000e+02 | 1.620000e+02 | 1.370000e+02 | 1.400000e+01 | 1.000000e+01 | 1.005000e+03 | 7.000000e+00 | 1.900000e+01 | 0.000000e+00 | NaN | 0.000000e+00 | NaN | NaN | NaN | NaN | NaN |
| max | 2000.0 | 1.200000e+01 | 3.100000e+01 | 7.000000e+00 | 2.400000e+03 | 2.400000e+03 | 2.400000e+03 | 2.400000e+03 | 6.879000e+03 | 8.500000e+02 | 6.840000e+02 | 6.510000e+02 | 1.441000e+03 | 1.435000e+03 | 4.962000e+03 | 2.570000e+02 | 4.360000e+02 | 1.000000e+00 | NaN | 1.000000e+00 | NaN | NaN | NaN | NaN | NaN |
df_flight.shape
(5683047, 29)
df_flight.nunique()
Year 1 Month 12 DayofMonth 31 DayOfWeek 7 DepTime 1434 CRSDepTime 1183 ArrTime 1440 CRSArrTime 1376 UniqueCarrier 11 FlightNum 3131 TailNum 4036 ActualElapsedTime 682 CRSElapsedTime 499 AirTime 630 ArrDelay 1051 DepDelay 1031 Origin 206 Dest 206 Distance 1110 TaxiIn 182 TaxiOut 345 Cancelled 2 CancellationCode 0 Diverted 2 CarrierDelay 0 WeatherDelay 0 NASDelay 0 SecurityDelay 0 LateAircraftDelay 0 dtype: int64
df_flight.isnull().sum()
Year 0 Month 0 DayofMonth 0 DayOfWeek 0 DepTime 187490 CRSDepTime 0 ArrTime 201744 CRSArrTime 0 UniqueCarrier 0 FlightNum 0 TailNum 0 ActualElapsedTime 201744 CRSElapsedTime 269 AirTime 201744 ArrDelay 201744 DepDelay 187490 Origin 0 Dest 0 Distance 0 TaxiIn 0 TaxiOut 0 Cancelled 0 CancellationCode 5683047 Diverted 0 CarrierDelay 5683047 WeatherDelay 5683047 NASDelay 5683047 SecurityDelay 5683047 LateAircraftDelay 5683047 dtype: int64
sum(df_flight.duplicated())
0
#Creat a copy of dataframe for save a clean data
df_flight_clean= df_flight.copy()
#Delete some columns that we wil not used
df_flight_clean.drop(columns=["CancellationCode","CarrierDelay","WeatherDelay","NASDelay"
,"SecurityDelay","LateAircraftDelay"],inplace=True)
#Test
df_flight_clean.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5683047 entries, 0 to 5683046 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 Year int64 1 Month int64 2 DayofMonth int64 3 DayOfWeek int64 4 DepTime float64 5 CRSDepTime int64 6 ArrTime float64 7 CRSArrTime int64 8 UniqueCarrier object 9 FlightNum int64 10 TailNum object 11 ActualElapsedTime float64 12 CRSElapsedTime float64 13 AirTime float64 14 ArrDelay float64 15 DepDelay float64 16 Origin object 17 Dest object 18 Distance int64 19 TaxiIn int64 20 TaxiOut int64 21 Cancelled int64 22 Diverted int64 dtypes: float64(7), int64(12), object(4) memory usage: 997.2+ MB
#Deal with a missing value in some columns
def missing_value(column):
df_flight_clean[column] = df_flight_clean[column].replace(['NaN'],' ')
missing_value('DepTime')
missing_value("ArrTime")
missing_value('ActualElapsedTime')
missing_value('CRSElapsedTime')
missing_value('AirTime')
missing_value('ArrDelay')
missing_value('DepDelay')
#Test
df_flight_clean.sample(5)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5113502 | 2000 | 11 | 10 | 5 | 2055.0 | 2000 | 2232.0 | 2131 | UA | 1440 | N7291U | 157.0 | 151.0 | 129.0 | 61.0 | 55.0 | ORD | COS | 911 | 9 | 19 | 0 | 0 |
| 1564586 | 2000 | 4 | 28 | 5 | 1102.0 | 1105 | 1213.0 | 1230 | NW | 1067 | N9332 | 71.0 | 85.0 | 58.0 | -17.0 | -3.0 | MDW | MSP | 349 | 3 | 10 | 0 | 0 |
| 5466439 | 2000 | 12 | 2 | 6 | 941.0 | 945 | 1048.0 | 1052 | NW | 1004 | N763NC | 67.0 | 67.0 | 40.0 | -4.0 | -4.0 | IND | DTW | 231 | 11 | 16 | 0 | 0 |
| 2413753 | 2000 | 6 | 25 | 7 | 1958.0 | 1944 | 2303.0 | 2210 | UA | 397 | N822UA | 305.0 | 266.0 | 268.0 | 53.0 | 14.0 | ORD | SJC | 1829 | 6 | 31 | 0 | 0 |
| 4641070 | 2000 | 10 | 29 | 7 | 2358.0 | 2355 | 503.0 | 513 | DL | 1946 | N543DA | 185.0 | 198.0 | 168.0 | -10.0 | 3.0 | SLC | ATL | 1589 | 5 | 12 | 0 | 0 |
#Change the data type for some columns
def Chaning_datatype_int(name):
df_flight_clean[name]=df_flight_clean[name].astype(float)
Chaning_datatype_int('CRSArrTime')
df_flight_clean['Cancelled'] = df_flight_clean['Cancelled'].astype(bool)
df_flight_clean['Diverted'] = df_flight_clean['Diverted'].astype(bool)
#change column name
df_flight_clean.rename(columns={'Dest': 'Airport_Dest'}, inplace = True)
#Test
df_flight_clean.sample(5)
| Year | Month | DayofMonth | DayOfWeek | DepTime | CRSDepTime | ArrTime | CRSArrTime | UniqueCarrier | FlightNum | TailNum | ActualElapsedTime | CRSElapsedTime | AirTime | ArrDelay | DepDelay | Origin | Airport_Dest | Distance | TaxiIn | TaxiOut | Cancelled | Diverted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2914341 | 2000 | 7 | 25 | 2 | 1022.0 | 1005 | 1119.0 | 1105.0 | WN | 345 | N314 | 57.0 | 60.0 | 41.0 | 14.0 | 17.0 | MDW | STL | 251 | 4 | 12 | False | False |
| 909267 | 2000 | 2 | 14 | 1 | 727.0 | 700 | 852.0 | 829.0 | UA | 243 | N407UA | 145.0 | 149.0 | 130.0 | 23.0 | 27.0 | ORD | DEN | 888 | 6 | 9 | False | False |
| 3940346 | 2000 | 9 | 22 | 5 | 1455.0 | 1455 | 1737.0 | 1736.0 | US | 2603 | N271AU | 162.0 | 161.0 | 140.0 | 1.0 | 0.0 | BWI | MIA | 946 | 7 | 15 | False | False |
| 75776 | 2000 | 1 | 9 | 7 | 1211.0 | 1205 | 1641.0 | 1703.0 | CO | 743 | N24212 | 210.0 | 238.0 | 189.0 | -22.0 | 6.0 | EWR | SJU | 1608 | 7 | 14 | False | False |
| 4171514 | 2000 | 9 | 8 | 5 | 1605.0 | 1605 | 1615.0 | 1615.0 | WN | 115 | N739 | 70.0 | 70.0 | 60.0 | 0.0 | 0.0 | CMH | BNA | 338 | 5 | 5 | False | False |
We have 22 columns 1-Year ,2-Month ,3-DayofMonth
4-DayOfWeek ,5- DepTime ,6-CRSDepTime ,7- ArrTime
,8- CRSArrTime ,9- UniqueCarrier ,10- FlightNum ,11-TailNum
,12-ActualElapsedTime ,13- CRSElapsedTime ,14- AirTime
,15- ArrDelay ,16- DepDelay ,17- Origin ,18- Dest ,19- Distance
20- TaxiIn ,21- TaxiOut ,22- Cancelled , 23- Diverted ,RangeIndex: 5683047 entries.
I intrest in some column such as FlightNum,Month,DayOfWeek,Distance,ArrDelay,DepDelay,Cancelled,UniqueCarrier.
I think the features that wil help my in support my investigation is FlightNum,Distance,Cancelled,UniqueCarrier,Month,DayOfWeek,Distance,ArrDelay,DepDelay.
# save cleaned data
df_flight_clean.to_csv('Flights_Dataset_2000_cleaned.csv', index=False)
#Number of Flights per Days.
plt.figure(figsize = [9,6])
base_color = sb.color_palette()[0]
sb.countplot(x='DayOfWeek',data=df_flight_clean,color=base_color), plt.grid(axis='y', alpha=0.10)
plt.ylabel('count',fontsize=13); plt.xlabel('Days' ,fontsize=13 ),plt.title("Number of Flights per Days", fontsize=20)
(Text(0.5, 0, 'Days'), Text(0.5, 1.0, 'Number of Flights per Days'))
We notice from the diagram all a days have a same a number of Flights, but we notice that the sixth day decreased the number of flights from the rest of the days.
#Number of Flights per month.
plt.figure(figsize = [9,6])
Months_of_flight='Month'
plt.hist(data= df_flight_clean, rwidth=0.69 , x =Months_of_flight , bins=np.arange(1,14),color=base_color), plt.grid(axis='y', alpha=0.17)
plt.ylabel('count',fontsize=14); plt.xlabel('Months' ,fontsize=14 ),plt.title("Number of Flights per month", fontsize=20)
plt.show()
We can see from the drawing the increase in the number of trips in month 3 , month 8 from the rest of the months of the year.
#The rate of cancelled for all Flight.
plt.figure(figsize = [9,6])
base_color = sb.color_palette()[0]
data_of_cancelled='Cancelled'
sb.countplot(x=data_of_cancelled ,data=df_flight_clean,color=base_color), plt.grid(axis='y', alpha=0.17);
plt.ylabel('count',fontsize=14); plt.xlabel('cancelled' ,fontsize=14 ),plt.title("The rate of cancelled for all Flight", fontsize=23);
We note from the drawing that the number of canceled flights is very low
#The Unique Carrier.
plt.figure(figsize = [11.69,8.27])
base_color = sb.color_palette()[0]
sb.countplot( x='UniqueCarrier',color=base_color, data=df_flight_clean), plt.grid(axis='y', alpha=0.17);
plt.ylabel('count',fontsize=14); plt.xlabel('Carrier' ,fontsize=14 ),plt.title("The Unique Carrier", fontsize=23);
We notice from the drawing the most common type of carrier is a WN , DL.
Distributions were normal. We discussed in the distributions these topics Number of Flights per Days, Number of Flights per month,The rate of cancelled for all Flight , The Unique Carrier Yes, I needed to convert a Cancelled column to boolean.
Distributions were normal , No.
#Max Number of Flight Vs Months
plt.figure(figsize = [11.69,8.27])
base_color = sb.color_palette()[0]
Flight_Month=df_flight_clean.groupby(['Month'], as_index=False)['FlightNum'].max()
sb.pointplot(data= Flight_Month ,y ='FlightNum', x='Month',color=base_color), plt.grid(axis='y', alpha=0.17);
plt.title('Max Number of Flight Vs Months ', fontsize=23);plt.ylabel('Number of Flights',fontsize=14); plt.xlabel('Months' ,fontsize=14 );
We note from the drawing that we want to find out what is the relationship of the months with the largest number of Flights, and we noticed that from month 7 to the end of the year they had the largest number of Flights.
#Max Distance Vs UniqueCarrier && #Min Distance Vs UniqueCarrier
plt.figure(figsize = [11.69,8.27])
base_color = sb.color_palette()[0]
#Max Distance Vs UniqueCarrier
plt.subplot(1, 2, 1)
Max_Dist_Carr=df_flight_clean.groupby(['UniqueCarrier'], as_index=False)['Distance'].max()
sb.barplot(color = base_color,data= Max_Dist_Carr ,y ='Distance', x='UniqueCarrier'), plt.grid(axis='y', alpha=0.17);
plt.ylabel('Distance(miles)',fontsize=14); plt.xlabel('UniqueCarrier' ,fontsize=14 ); plt.title('Max Distance Vs UniqueCarrier', fontsize=17)
#Min Distance Vs UniqueCarrier
plt.subplot(1, 2, 2)
Min_Dist_Carr=df_flight_clean.groupby(['UniqueCarrier'], as_index=False)['Distance'].min()
sb.barplot(color = base_color, data= Min_Dist_Carr ,y ='Distance', x='UniqueCarrier'), plt.grid(axis='y', alpha=0.17);
plt.ylabel('Distance(miles)',fontsize=14); plt.xlabel('UniqueCarrier' ,fontsize=14 ); plt.title('Min Distance Vs UniqueCarrier', fontsize=17)
Text(0.5, 1.0, 'Min Distance Vs UniqueCarrier')
We note from the drawing that we want to find out which UniqueCarrier have a Max Distance and a Min Distance and the result is the UniqueCarrier have a Max Distance is CO and UniqueCarrier have a Min Distance is AA.
#Cancelled Vs Months
plt.figure(figsize = [11.69,8.27])
sb.countplot(x='Month', hue='Cancelled',data=df_flight_clean),plt.grid(axis='y', alpha=0.17);
plt.ylabel('count',fontsize=14); plt.xlabel('Months' ,fontsize=14 ),plt.title('Cancelled Vs Months ', fontsize=23);
We notice through the diagram that we want to discover the relationship between the cancellation rate in each month and in any month the cancellation rate is higher than for the rest of the months., and we noticed that in 1 month and 12 month, the cancellation rate was higher than for the rest of the months.
#Cancelled Vs UniqueCarrier
plt.figure(figsize = [11.69,8.27])
sb.countplot(x='UniqueCarrier', hue='Cancelled',data=df_flight_clean), plt.grid(axis='y', alpha=0.17);
plt.ylabel('count',fontsize=14); plt.xlabel('UniqueCarrier' ,fontsize=14),plt.title('Cancelled Vs UniqueCarrier', fontsize=23);
We notice through the diagram that we want to discover the relationship between the cancellation rate of the unique carrier, and we have noticed that the carrier UA had the highest rate of cancellation, and we also note that the carrier US , DL, AA had the same cancellation rate.
We discussed in the relationships these topics: Max Number of Flight Vs Months Which’s has a strong relationships , Max Distance Vs UniqueCarrier and Min Distance Vs UniqueCarrier Which’s has a strong relationships ,Cancelled Vs UniqueCarrier Which’s has a middle relationships, Cancelled Vs Months Which’s has a middle relationships.
Yes, The Max Number of Flight Vs Months Which’s has a strong relationships , and The Max Distance Vs UniqueCarrier and Min Distance Vs UniqueCarrier Which’s has a strong relationships.
#UniqueCarrier Vs FlightNum & Dest & Cancelled
Carrier_AIR_Numf_Cn=df_flight_clean.loc[df_flight_clean['UniqueCarrier'].isin(['DL','US','CO','NW'])]
Carrier_AIR_Numf_Cn=df_flight_clean.loc[df_flight_clean['Airport_Dest'].isin(['ORD', "ATL", "DFW" ,"LAX"])]
px.histogram(Carrier_AIR_Numf_Cn,y="FlightNum", x="UniqueCarrier", facet_col="Airport_Dest",color="Cancelled")
#plt.yticks(np.arange(0,1,6879))